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Abstract 

Supporting  natural  language  input  may 
improve  learning  in  intelligent  tutoring 
systems.  However,  interpretation  errors 
are  unavoidable  and  require  an  effective 
recovery  policy.  We  describe  an  evaluation 
of  an  error  recovery  policy  in  the  Bee¬ 
tle  II  tutorial  dialogue  system  and  dis¬ 
cuss  how  different  types  of  interpretation 
problems  affect  learning  gain  and  user  sat¬ 
isfaction.  In  particular,  the  problems  aris¬ 
ing  from  student  use  of  non-standard  ter¬ 
minology  appeal-  to  have  negative  conse¬ 
quences.  We  argue  that  existing  strategies 
for  dealing  with  terminology  problems  are 
insufficient  and  that  improving  such  strate¬ 
gies  is  important  in  future  ITS  research. 

1  Introduction 

There  is  a  mounting  body  of  evidence  that  student 
self-explanation  and  contentful  talk  in  human- 
human  tutorial  dialogue  are  correlated  with  in¬ 
creased  learning  gain  (Chi  et  al.,  1994;  Purandare 
and  Litman,  2008;  Litman  et  al.,  2009).  Thus, 
computer  tutors  that  understand  student  explana¬ 
tions  have  the  potential  to  improve  student  learn¬ 
ing  (Graesser  et  al.,  1999;  Jordan  et  al.,  2006; 
Aleven  et  al.,  2001;  Dzikovska  et  al.,  2008).  How¬ 
ever,  understanding  and  correctly  assessing  the 
student’s  contributions  is  a  difficult  problem  due 
to  the  wide  range  of  variation  observed  in  student 
input,  and  especially  due  to  students’  sometimes 
vague  and  incorrect  use  of  domain  terminology. 

Many  tutorial  dialogue  systems  limit  the  range 
of  student  input  by  asking  short-answer  questions. 
This  provides  a  measure  of  robustness,  and  previ¬ 
ous  evaluations  of  ASR  in  spoken  tutorial  dialogue 
systems  indicate  that  neither  word  error  rate  nor 
concept  error  rate  in  such  systems  affect  learning 
gain  (Litman  and  Forbes-Riley,  2005;  Pon- Barry 


et  al.,  2004).  However,  limiting  the  range  of  pos¬ 
sible  input  limits  the  contentful  talk  that  the  stu¬ 
dents  are  expected  to  produce,  and  therefore  may 
limit  the  overall  effectiveness  of  the  system. 

Most  of  the  existing  tutoring  systems  that  accept 
unrestricted  language  input  use  classifiers  based 
on  statistical  text  similarity  measures  to  match 
student  answers  to  open-ended  questions  with 
pre-authored  anticipated  answers  (Graesser  et  al., 
1999;  Jordan  et  al.,  2004;  McCarthy  et  al.,  2008). 
While  such  systems  are  robust  to  unexpected  ter¬ 
minology,  they  provide  only  a  very  coarse-grained 
assessment  of  student  answers.  Recent  research 
aims  to  develop  methods  that  produce  detailed 
analyses  of  student  input,  including  correct,  in¬ 
correct  and  missing  parts  (Nielsen  et  al.,  2008; 
Dzikovska  et  al.,  2008),  because  the  more  detailed 
assessments  can  help  tailor  tutoring  to  the  needs  of 
individual  students. 

While  the  detailed  assessments  of  answers  to 
open-ended  questions  are  intended  to  improve  po¬ 
tential  learning,  they  also  increase  the  probabil¬ 
ity  of  misunderstandings,  which  negatively  impact 
tutoring  and  therefore  negatively  impact  student 
learning  (Jordan  et  al.,  2009).  Thus,  appropri¬ 
ate  error  recovery  strategies  are  crucially  impor¬ 
tant  for  tutorial  dialogue  applications.  We  describe 
an  evaluation  of  an  implemented  tutorial  dialogue 
system  which  aims  to  accept  unrestricted  student 
input  and  limit  misunderstandings  by  rejecting  low 
confidence  interpretations  and  employing  a  range 
of  error  recovery  strategies  depending  on  the  cause 
of  interpretation  failure. 

By  comparing  two  different  system  policies,  we 
demonstrate  that  with  less  restricted  language  in¬ 
put  the  rate  of  non-understanding  errors  impacts 
both  learning  gain  and  user  satisfaction,  and  that 
problems  arising  from  incorrect  use  of  terminol¬ 
ogy  have  a  particularly  negative  impact.  A  more 
detailed  analysis  of  the  results  indicates  that,  even 
though  we  based  our  policy  on  an  approach  ef- 
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fective  in  task-oriented  dialogue  (Hockey  et  al., 
2003),  many  of  our  strategies  were  not  success¬ 
ful  in  improving  learning  gain.  At  the  same  time, 
students  appear  to  be  aware  that  the  system  does 
not  fully  understand  them  even  if  it  accepts  their 
input  without  indicating  that  it  is  having  interpre¬ 
tation  problems,  and  this  is  reflected  in  decreased 
user  satisfaction.  We  argue  that  this  indicates  that 
we  need  better  strategies  for  dealing  with  termi¬ 
nology  problems,  and  that  accepting  non-standard 
terminology  without  explicitly  addressing  the  dif¬ 
ference  in  acceptable  phrasing  may  not  be  suffi¬ 
cient  for  effective  tutoring. 

In  Section  2  we  describe  our  tutoring  system, 
and  the  two  tutoring  policies  implemented  for  the 
experiment.  In  Section  3  we  present  experimen¬ 
tal  results  and  an  analysis  of  correlations  between 
different  types  of  interpretation  problems,  learning 
gain  and  user  satisfaction.  Finally,  in  Section  4  we 
discuss  the  implications  of  our  results  for  error  re¬ 
covery  policies  in  tutorial  dialogue  systems. 

2  Tutorial  Dialogue  System  and  Error 
Recovery  Policies 

This  work  is  based  on  evaluation  of  Beetle  II 
(Dzikovska  et  ah,  2010),  a  tutorial  dialogue  sys¬ 
tem  which  provides  tutoring  in  basic  electricity 
and  electronics.  Students  read  pre-authored  mate¬ 
rials,  experiment  with  a  circuit  simulator,  and  then 
are  asked  to  explain  their  observations.  Beetle  II 
uses  a  deep  parser  together  with  a  domain-specific 
diagnoser  to  process  student  input,  and  a  deep  gen¬ 
erator  to  produce  tutorial  feedback  automatically 
depending  on  the  current  tutorial  policy.  It  also 
implements  an  error  recovery  policy  to  deal  with 
interpretation  problems. 

Students  currently  communicate  with  the  sys¬ 
tem  via  a  typed  chat  interface.  While  typing 
removes  the  uncertainty  and  errors  involved  in 
speech  recognition,  expected  student  answers  are 
considerably  more  complex  and  varied  than  in 
a  typical  spoken  dialogue  system.  Therefore,  a 
significant  number  of  interpretation  errors  arise, 
primarily  during  the  semantic  interpretation  pro¬ 
cess.  These  errors  can  lead  to  non-understandings, 
when  the  system  cannot  produce  a  syntactic  parse 
(or  a  reasonable  fragmentary  parse),  or  when  it 
does  not  know  how  to  interpret  an  out-of-domain 
word;  and  misunderstandings,  where  a  system  ar¬ 
rives  at  an  incorrect  interpretation,  due  to  either 
an  incorrect  attachment  in  the  parse,  an  incorrect 


word  sense  assigned  to  an  ambiguous  word,  or  an 
incorrectly  resolved  referential  expression. 

Our  approach  to  selecting  an  error  recovery  pol¬ 
icy  is  to  prefer  non-understandings  to  misunder¬ 
standings.  There  is  a  known  trade-off  in  spoken  di¬ 
alogue  systems  between  allowing  misunderstand¬ 
ings,  i.e.,  cases  in  which  a  system  accepts  and 
acts  on  an  incorrect  interpretation  of  an  utterance, 
and  non-understandings,  i.e.,  cases  in  which  a  sys¬ 
tem  rejects  an  utterance  as  uninterpretable  (Bo- 
hus  and  Rudnicky,  2005).  Since  misunderstand¬ 
ings  on  the  part  of  a  computer  tutor  are  known 
to  negatively  impact  student  learning,  and  since 
in  human-human  tutorial  dialogue  the  majority  of 
student  responses  using  unexpected  terminology 
arc  classified  as  incorrect  (Jordan  et  al.,  2009), 
it  would  be  a  reasonable  approach  for  a  tutorial 
dialogue  system  to  deal  with  potential  interpreta¬ 
tion  problems  by  treating  low-confidence  interpre¬ 
tations  as  non-understandings  and  focusing  on  an 
effective  non- understanding  recovery  policy.1 

We  implemented  two  different  policies  for  com¬ 
parison.  Our  baseline  policy  does  not  attempt  any 
remediation  or  error  recovery.  All  student  utter¬ 
ances  are  passed  through  the  standard  interpreta¬ 
tion  pipeline,  so  that  the  results  can  be  analyzed 
later.  However,  the  system  does  not  attempt  to  ad¬ 
dress  the  student  content.  Instead,  regardless  of 
the  answer  analysis,  the  system  always  uses  a  neu¬ 
tral  acceptance  and  bottom  out  strategy,  giving  the 
student  the  correct  answer  every  time,  e.g.,  “OK. 
One  way  to  phrase  the  correct  answer  is:  the  open 
switch  creates  a  gap  in  the  circuit”.  Thus,  the  stu¬ 
dents  arc  never  given  any  indication  of  whether 
they  have  been  understood  or  not. 

The  full  policy  acts  differently  depending  on  the 
analysis  of  the  student  answer.  For  correct  an¬ 
swers,  it  acknowledges  the  answer  as  correct  and 
optionally  restates  it  (see  (Dzikovska  et  ah,  2008) 
for  details).  For  incorrect  answers,  it  restates  the 
correct  portion  of  the  answer  (if  any)  and  provides 
a  hint  to  guide  the  student  towards  the  completely 
correct  answer.  If  the  student’s  utterance  cannot  be 
interpreted,  the  system  responds  with  a  help  mes¬ 
sage  indicating  the  cause  of  the  problem  together 
with  a  hint.  In  both  cases,  after  3  unsuccessful  at¬ 
tempts  to  address  the  problem  the  system  uses  the 
bottom  out  strategy  and  gives  away  the  answer. 

1  While  there  is  no  confidence  score  from  a  speech  recog¬ 
nizer,  our  system  uses  a  combination  of  a  parse  quality  score 
assigned  by  the  parser  and  a  set  of  consistency  checks  to  de¬ 
termine  whether  an  interpretation  is  sufficiently  reliable. 
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The  content  of  the  bottom  out  is  the  same  as  in 
the  baseline,  except  that  the  full  system  indicates 
clearly  that  the  answer  was  incorrect  or  was  not 
understood,  e.g.,  “Not  quite.  Here  is  the  answer: 
the  open  switch  creates  a  gap  in  the  circuit”. 

The  help  messages  arc  based  on  the  Targeted- 
Help  approach  successfully  used  in  spoken  dia¬ 
logue  (Hockey  et  al.,  2003),  together  with  the  error 
classification  we  developed  for  tutorial  dialogue 
(Dzikovska  et  al.,  2009).  There  are  9  different  er¬ 
ror  types,  each  associated  with  a  different  targeted 
help  message.  The  goal  of  the  help  messages  is  to 
give  the  student  as  much  information  as  possible 
as  to  why  the  system  failed  to  understand  them  but 
without  giving  away  the  answer. 

In  comparing  the  two  policies,  we  would  expect 
that  the  students  in  both  conditions  would  learn 
something,  but  that  the  learning  gain  and  user  sat¬ 
isfaction  would  be  affected  by  the  difference  in 
policies.  We  hypothesized  that  students  who  re¬ 
ceive  feedback  on  their  errors  in  the  full  condition 
would  learn  more  compared  to  those  in  the  base¬ 
line  condition. 

3  Evaluation 

We  collected  data  from  76  subjects  interacting 
with  the  system.  The  subjects  were  randomly  as¬ 
signed  to  either  the  baseline  (BASE)  or  the  full 
(FULL)  policy  condition.  Each  subject  took  a  pre¬ 
test,  then  worked  through  a  lesson  with  the  system, 
and  then  took  a  post-test  and  filled  in  a  user  satis¬ 
faction  survey.  Each  session  lasted  approximately 

4  hours,  with  232  student  language  turns  in  FULL 
(SD  =  25.6)  and  156  in  BASE  (SD  =  2.02).  Ad¬ 
ditional  time  was  taken  by  reading  and  interact¬ 
ing  with  the  simulation  environment.  The  students 
had  little  prior  knowledge  of  the  domain.  The  sur¬ 
vey  consisted  of  63  questions  on  the  5-point  Lik¬ 
ert  scale  covering  the  lesson  content,  the  graphical 
user  interface,  and  tutor’s  understanding  and  feed¬ 
back.  For  purposes  of  this  study,  we  are  using  an 
averaged  tutor  score. 

The  average  learning  gain  was  0.57  (SD  = 
0.23)  in  FULL,  and  0.63  (SD  =  0.26)  in  BASE. 
There  was  no  significant  difference  in  learning 
gain  between  conditions.  Students  liked  BASE  bet¬ 
ter:  the  average  tutor  evaluation  score  for  FULL 
was  2.56  out  of  5  (SD  =  0.65),  compared  to  3.32 
(SD  =  0.65)  in  BASE.  These  results  are  signif¬ 
icantly  different  (t-test,  p  <  0.05).  In  informal 
comments  after  the  session  many  students  said  that 


they  were  frustrated  when  the  system  said  that  it 
did  not  understand  them.  However,  some  students 
in  BASE  also  mentioned  that  they  sometimes  were 
not  sure  if  the  system’s  answer  was  correcting  a 
problem  with  their  answer,  or  simply  phrasing  it 
in  a  different  way. 

We  used  mean  frequency  of  non-interpretable 
utterances  (out  of  all  student  utterances  in 
each  session)  to  evaluate  the  effectiveness  of 
the  two  different  policies.  On  average,  14% 
of  utterances  in  both  conditions  resulted  in 
non-understandings.2  The  frequency  of  non¬ 
understandings  was  negatively  correlated  with 
learning  gain  in  FULL:  r  =  —0A7,p  <  0.005, 
but  not  significantly  correlated  with  learning  gain 
in  BASE:  r  =  — 0.09,p  =  0.59.  However,  in  both 
conditions  the  frequency  of  non-understandings 
was  negatively  correlated  with  user  satisfaction: 
FULL  r  =  -0.36, p  =  0.03,  BASE  r  =  -0.4, p  = 
0.01.  Thus,  even  though  in  BASE  the  system 
did  not  indicate  non-understanding,  students  were 
negatively  affected.  That  is,  they  were  not  satis¬ 
fied  with  the  policy  that  did  not  directly  address 
the  interpretation  problems.  We  discuss  possible 
reasons  for  this  below. 

We  investigated  the  effect  of  different  types  of 
interpretation  errors  using  two  criteria.  First,  we 
checked  whether  the  mean  frequency  of  errors  was 
reduced  between  BASE  and  FULL  for  each  individ¬ 
ual  strategy.  The  reduced  frequency  means  that 
the  recovery  strategy  for  this  particular  error  type 
is  effective  in  reducing  the  error  frequency.  Sec¬ 
ond,  we  looked  for  the  cases  where  the  frequency 
of  a  given  error  type  is  negatively  correlated  with 
either  learning  gain  or  user  satisfaction.  This  is 
provides  evidence  that  such  errors  arc  negatively 
impacting  the  learning  process,  and  therefore  im¬ 
proving  recovery  strategies  for  those  error  types  is 
likely  to  improve  overall  system  effectiveness, 

The  results,  shown  in  Table  1,  indicate  that  the 
majority  of  interpretation  problems  arc  not  sig¬ 
nificantly  correlated  with  learning  gain.  How¬ 
ever,  several  types  of  problems  appear  to  be 
particularly  significant,  and  arc  all  related  to 
improper  use  of  domain  terminology.  These 
were  irrelevant janswer,  no _appr Jerms ,  selec- 
tional  restriction  failure  and  program  error. 

An  irrelevant  janswer  error  occurs  when  the  stu¬ 
dent  makes  a  statement  that  uses  domain  termi- 

2We  do  not  know  the  percentage  of  misunderstandings  or 
concept  error  rate  as  yet.  We  are  currently  annotating  the  data 
with  the  goal  to  evaluate  interpretation  correctness. 
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full 

baseline 

error  type 

mean  freq. 
(std.  dev) 

satisfac¬ 
tion  r 

gain 

r 

mean  freq 
(std.  dev) 

satisfac¬ 
tion  r 

gain 

r 

irrelevant_answer 

0.008  (0.01) 

-0.08 

-0.19 

0.012  (0.01) 

-0.07 

-0.47** 

no_appr_  terms 

0.005  (0.01) 

-0.57** 

-0.42** 

0.003  (0.01) 

-0.38** 

-0.01 

selectional_restr_failure 

0.032  (0.02) 

-0.12 

-0.55** 

0.040  (0.03) 

0.13 

0.26* 

progranuerror 

0.002  (0.003) 

0.02 

0.26 

0.003  (0.003) 

0 

-0.35** 

unknown  word 

0.023  (0.01) 

0.05 

-0.21 

0.024  (0.02) 

-0.15 

-0.09 

disambiguationTailure 

0.013  (0.01) 

-0.04 

0.02 

0.007  (0.01) 

-0.18 

0.19 

no_parse 

0.019(0.01) 

-0.14 

-0.08 

0.022(0.02) 

-0.3* 

0.01 

partiaLinterpretation 

0.004  (0.004) 

-0.11 

-0.01 

0.004  (0.005) 

-0.19 

0.22 

reference  .failure 

0.012  (0.02) 

-0.31* 

-0.09 

0.017  (0.01) 

-0.15 

-0.23 

Overall 

0.134  (0.05) 

-0.36** 

-0.47** 

0.139  (0.04) 

-0.4** 

-0.09 

Table  1 :  Correlations  between  frequency  of  different  error  types  and  student  learning  gain  and  satisfac¬ 
tion.  **  -  correlation  is  significant  withp  <  0.05,  *  -  withp  <=  0.1. 


nology  but  does  not  appeal-  to  answer  the  system's 
question  directly.  For  example,  the  expected  an¬ 
swer  to  “In  circuit  1,  which  components  are  in  a 
closed  path?”  is  “the  bulb”.  Some  students  mis¬ 
read  the  question  and  say  “Circuit  1  is  closed.”  If 
that  happens,  in  FULL  the  system  says  “Sorry,  this 
isn't  the  form  of  answer  that  I  expected.  I  am  look¬ 
ing  for  a  component”,  pointing  out  to  the  student 
the  kind  of  information  it  is  looking  for.  The  BASE 
system  for  this  error,  and  for  all  other  errors  dis¬ 
cussed  below,  gives  away  the  correct  answer  with¬ 
out  indicating  that  there  was  a  problem  with  in¬ 
terpreting  the  student’s  utterance,  e.g.,  “OK,  the 
correct  answer  is  the  bulb.” 

The  no  apprJerms  error  happens  when  the  stu¬ 
dent  is  using  terminology  inappropriate  for  the  les¬ 
son  in  general.  Students  are  expected  to  learn  to 
explain  everything  in  terms  of  connections  and  ter¬ 
minal  states.  For  example,  the  expected  answer  to 
“What  is  voltage?”  is  “the  difference  in  states  be¬ 
tween  two  terminals”.  If  instead  the  student  says 
“Voltage  is  electricity”,  FULL  responds  with  “I  am 
sorry,  I  am  having  trouble  understanding.  I  see  no 
domain  concepts  in  your  answer.  Here’s  a  hint: 
your  answer  should  mention  a  terminal.”  The  mo¬ 
tivation  behind  this  strategy  is  that  in  general,  it  is 
very  difficult  to  reason  about  vaguely  used  domain 
terminology.  We  had  hoped  that  by  telling  the  stu¬ 
dent  that  the  content  of  their  utterance  is  outside 
the  domain  as  understood  by  the  system,  and  hint¬ 
ing  at  the  correct  terms  to  use,  the  system  would 
guide  students  towards  a  better  answer. 

Select ionaLrestrJai lure  errors  are  typically  due 
to  incorrect  terminology,  when  the  students 
phrased  answers  in  a  way  that  contradicted  the  sys¬ 


tem's  domain  knowledge.  For  example,  the  sys¬ 
tem  can  reason  about  damaged  bulbs  and  batter¬ 
ies,  and  open  and  closed  paths.  So  if  the  stu¬ 
dent  says  “The  path  is  damaged”,  the  FULL  sys¬ 
tem  would  respond  with  “I  am  sorry,  I  am  having 
trouble  understanding.  Paths  cannot  be  damaged. 
Only  bulbs  and  batteries  can  be  damaged.” 

Program^error  were  caused  by  faults  in  the  un¬ 
derlying  network  software,  but  usually  occurred 
when  the  student  was  using  extremely  long  and 
complicated  utterances. 

Out  of  the  four  important  error  types  described 
above,  only  the  strategy  for  irrelevant janswer  was 
effective:  the  frequency  of  irrelevant-answer  er¬ 
rors  is  significantly  higher  in  BASE  (t-test,  p  < 
0.05),  and  it  is  negatively  correlated  with  learning 
gain  in  BASE.  The  frequencies  of  other  error  types 
did  not  significantly  differ  between  conditions. 

However,  one  other  finding  is  particularly  in¬ 
teresting:  the  frequency  of  no_apprJerms  errors 
is  negatively  correlated  with  user  satisfaction  in 
BASE.  This  indicates  that  simply  accepting  the  stu¬ 
dent’s  answer  when  they  are  using  incorrect  termi¬ 
nology  and  exposing  them  to  the  correct  answer  is 
not  the  best  strategy,  possibly  because  the  students 
are  noticing  the  unexplained  lack  of  alignment  be¬ 
tween  their  utterance  and  the  system’s  answer. 

4  Discussion  and  Future  Work 

As  discussed  in  Section  1,  previous  studies  of 
short-answer  tutorial  dialogue  systems  produced  a 
counter-intuitive  result:  measures  of  interpretation 
accuracy  were  not  correlated  with  learning  gain. 
With  less  restricted  language,  misunderstandings 
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negatively  affected  learning.  Our  study  provides 
further  evidence  that  interpretation  quality  signif¬ 
icantly  affects  learning  gain  in  tutorial  dialogue. 
Moreover,  while  it  has  long  been  known  that  user 
satisfaction  is  negatively  correlated  with  interpre¬ 
tation  error  rates  in  spoken  dialogue,  this  is  the 
first  attempt  to  evaluate  the  impact  of  different 
types  of  interpretation  errors  on  task  success  and 
usability  of  a  tutoring  system. 

Our  results  demonstrate  that  different  types  of 
errors  may  matter  to  a  different  degree.  In  our 
system,  all  of  the  error  types  negatively  correlated 
with  learning  gain  stem  from  the  same  underlying 
problem:  the  use  of  incorrect  or  vague  terminol¬ 
ogy  by  the  student.  With  the  exception  of  the  ir¬ 
relevant  janswer  strategy,  the  targeted  help  strate¬ 
gies  we  implemented  were  not  effective  in  reduc¬ 
ing  error  frequency  or  improving  learning  gain. 
Additional  research  is  needed  to  understand  why. 
One  possibility  is  that  irrelevant  janswer  was  eas¬ 
ier  to  remediate  compared  to  other  error  types.  It 
usually  happened  in  situations  where  there  was  a 
clear  expectation  of  the  answer  type  (e.g.,  a  list  of 
component  names,  a  yes/no  answer).  Therefore, 
it  was  easier  to  design  an  effective  prompt.  Help 
messages  for  other  error  types  were  more  frequent 
when  the  expected  answer  was  a  complex  sen¬ 
tence,  and  multiple  possible  ways  of  phrasing  the 
correct  answer  were  acceptable.  Therefore,  it  was 
more  difficult  to  formulate  a  prompt  that  would 
clearly  describe  the  problem  in  all  contexts. 

One  way  to  improve  the  help  messages  may  be 
to  have  the  system  indicate  more  clearly  when  user 
terminology  is  a  problem.  Our  system  apologized 
each  time  there  was  a  non-understanding,  leading 
students  to  believe  that  they  may  be  answering  cor¬ 
rectly  but  the  answer  is  not  being  understood.  A 
different  approach  would  be  to  say  something  like 
“I  am  sorry,  you  arc  not  using  the  correct  termi¬ 
nology  in  your  answer.  Here’s  a  hint:  your  answer 
should  mention  a  terminal”.  Together  with  an  ap¬ 
propriate  mechanism  to  detect  paraphrases  of  cor¬ 
rect  answers  (as  opposed  to  vague  answers  whose 
correctness  is  difficult  to  determine),  this  approach 
could  be  more  beneficial  in  helping  students  learn. 
We  are  considering  implementing  and  evaluating 
this  as  part  of  our  future  work. 

Some  of  the  errors,  in  particular  instances  of 
no Mppr Jerms  and  selectionaLrestr -failure,  also 
stemmed  from  unrecognized  paraphrases  with 
non-standard  terminology.  Those  answers  could 


conceivably  be  accepted  by  a  system  using  seman¬ 
tic  similarity  as  a  metric  (e.g.,  using  LSA  with  pre¬ 
authored  answers).  However,  our  results  also  indi¬ 
cate  that  simply  accepting  the  incorrect  terminol¬ 
ogy  may  not  be  the  best  strategy.  Users  appeal-  to 
be  sensitive  when  the  system’s  language  does  not 
align  with  their  terminology,  as  reflected  in  the  de¬ 
creased  satisfaction  ratings  associated  with  higher 
rates  of  incorrect  terminology  problems  in  BASE. 
Moreover,  prior  analysis  of  human-human  data 
indicates  that  tutors  use  different  restate  strate¬ 
gies  depending  on  the  “quality”  of  the  student  an¬ 
swers,  even  if  they  are  accepting  them  as  correct 
(Dzikovska  et  ah,  2008).  Together,  these  point  at 
an  important  unaddressed  issue:  existing  systems 
are  often  built  on  the  assumption  that  only  incor¬ 
rect  and  missing  parts  of  the  student  answer  should 
be  remediated,  and  a  wide  range  of  terminology 
should  be  accepted  (Graesser  et  ah,  1999;  Jordan 
et  ah,  2006).  While  it  is  obviously  important  for 
the  system  to  accept  a  range  of  different  phrasings, 
our  analysis  indicates  that  this  may  not  be  suffi¬ 
cient  by  itself,  and  students  could  potentially  ben¬ 
efit  from  addressing  the  terminology  issues  with  a 
specifically  devised  strategy. 

Finally,  it  could  also  be  possible  that  some 
differences  between  strategy  effectiveness  were 
caused  by  incorrect  error  type  classification.  Man¬ 
ual  examination  of  several  dialogues  suggests  that 
most  of  the  errors  are  assigned  to  the  appropri¬ 
ate  type,  though  in  some  cases  incorrect  syntac¬ 
tic  parses  resulted  in  unexpected  interpretation  er¬ 
rors,  causing  the  system  to  give  a  confusing  help 
message.  These  misclassifications  appeal-  to  be 
evenly  split  between  different  error  types,  though 
a  more  formal  evaluation  is  planned  in  the  fu¬ 
ture.  However  from  our  initial  examination,  we 
believe  that  the  differences  in  strategy  effective¬ 
ness  that  we  observed  are  due  to  the  actual  differ¬ 
ences  in  the  help  messages.  Therefore,  designing 
better  prompts  would  be  the  key  factor  in  improv¬ 
ing  learning  and  user  satisfaction. 
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